regularization property
Regularization properties of adversarially-trained linear regression
State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against it. Formulated as a min-max problem, it searches for the best solution when the training data were corrupted by the worst-case attacks. Linear models are among the simple models where vulnerabilities can be observed and are the focus of our study. In this case, adversarial training leads to a convex optimization problem which can be formulated as the minimization of a finite sum. We provide a comparative analysis between the solution of adversarial training in linear regression and other regularization methods. Our main findings are that: (A) Adversarial training yields the minimum-norm interpolating solution in the overparameterized regime (more parameters than data), as long as the maximum disturbance radius is smaller than a threshold. And, conversely, the minimum-norm interpolator is the solution to adversarial training with a given radius.
Regularization properties of adversarially-trained linear regression
State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against it. Formulated as a min-max problem, it searches for the best solution when the training data were corrupted by the worst-case attacks. Linear models are among the simple models where vulnerabilities can be observed and are the focus of our study. In this case, adversarial training leads to a convex optimization problem which can be formulated as the minimization of a finite sum.
Regularization properties of adversarially-trained linear regression
State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against it. Formulated as a min-max problem, it searches for the best solution when the training data were corrupted by the worst-case attacks. Linear models are among the simple models where vulnerabilities can be observed and are the focus of our study. In this case, adversarial training leads to a convex optimization problem which can be formulated as the minimization of a finite sum.
On Approximation in Deep Convolutional Networks: a Kernel Perspective
Deep convolutional models have been at the heart of the recent successes of deep learning in problems where the data consists of high-dimensional signals, such as image classification or speech recognition. The convolution and pooling operations in these architectures are known to be crucial for their practical success, yet our theoretical understanding of how they enable efficient learning is still limited. One key difficulty for understanding such models is the curse of dimensionality: due to the highdimensionality of the input data, it is hopeless to learn arbitrary functions from samples. For instance, classical non-parametric regression techniques for approximating Lipschitz or Sobolev functions typically require either low dimension or an order of smoothness of the target function comparable to the dimension in order to obtain good generalization (e.g., Wainwright, 2019), which is a very strong assumption when dealing with high-dimensional signals. Thus, further assumptions on the target function are needed to make the problem more tractable, in a way that makes convolutions a useful modeling tool. Various works have studied approximation benefits with models that resemble deep convolutional architectures, for instance through hierarchical models with local connectivity (Mhaskar and Poggio, 2016; Schmidt-Hieber et al., 2020), or through structured tensor decompositions (Cohen and Shashua, 2017). Nevertheless, while such function classes may provide improved statistical efficiency, it is unclear if they can be learned with computationally efficient algorithms, which makes it difficult to assess the validity of these approximation models empirically. In order to overcome the computational difficulties, we provide a different perspective based on kernel methods (e.g., Schölkopf and Smola, 2001; Wahba, 1990), which are known to be computationally tractable with well-understood statistical and approximation properties. In particular, we consider "deep" structured kernels known as convolutional kernels, which have produced good empirical performance on standard
On the Regularization Properties of Structured Dropout
Pal, Ambar, Lane, Connor, Vidal, René, Haeffele, Benjamin D.
Dropout and its extensions (eg. DropBlock and DropConnect) are popular heuristics for training neural networks, which have been shown to improve generalization performance in practice. However, a theoretical understanding of their optimization and regularization properties remains elusive. Recent work shows that in the case of single hidden-layer linear networks, Dropout is a stochastic gradient descent method for minimizing a regularized loss, and that the regularizer induces solutions that are low-rank and balanced. In this work we show that for single hidden-layer linear networks, DropBlock induces spectral k-support norm regularization, and promotes solutions that are low-rank and have factors with equal norm. We also show that the global minimizer for DropBlock can be computed in closed form, and that DropConnect is equivalent to Dropout. We then show that some of these results can be extended to a general class of Dropout-strategies, and, with some assumptions, to deep non-linear networks when Dropout is applied to the last layer. We verify our theoretical claims and assumptions experimentally with commonly used network architectures.
On Regularization Properties of Artificial Datasets for Deep Learning
In this paper, w e have presented analogies between the regularization methods for deep learning and data augmentation process interpreted as a noise injection. It was shown that, by generating the input data from high - level features, it is possible to regularize hidden layers of the netwo rk by exploiting the ability of deep networks to learn hierarchical representations . The analysis given here is theoretical, but there already are experimental results that partially confirm these observations . A case of convolutional neural networks for stenosis detection [14] have shown that pretraining the network on artificial dataset results in reduction of test error rate on real dataset, and, thus, smaller generalization gap. An improvement of test accuracy was also observed in the case of recurrent neural networks for ECG filtering, pretrained with synthetic signals [15] . A more definitive confirmation should be expected by the comparison of models trained for the same task with dataset s created by injecting noise either into input features or high - level features of the real data.
Implicit Regularization of Accelerated Methods in Hilbert Spaces
Pagliana, Nicolò, Rosasco, Lorenzo
We study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. We analyze the implicit regularization properties of Nesterov acceleration and a variant of heavy-ball in terms of corresponding learning error bounds. Our results show that acceleration can provides faster bias decay than gradient descent, but also suffers of a more unstable behavior. As a result acceleration cannot be in general expected to improve learning accuracy with respect to gradient descent, but rather to achieve the same accuracy with reduced computations. Our theoretical results are validated by numerical simulations. Our analysis is based on studying suitable polynomials induced by the accelerated dynamics and combining spectral techniques with concentration inequalities.
Identifying global optimality for dictionary learning
Learning new representations of input observations in machine learning is often tackled using a factorization of the data. For many such problems, including sparse coding and matrix completion, learning these factorizations can be difficult, in terms of efficiency and to guarantee that the solution is a global minimum. Recently, a general class of objectives have been introduced--which we term induced dictionary learning models (DLMs)--that have an induced convex form that enables global optimization. Though attractive theoretically, this induced form is impractical, particularly for large or growing datasets. In this work, we investigate the use of practical alternating minimization algorithms for induced DLMs, that ensure convergence to global optima. We characterize the stationary points of these models, and, using these insights, highlight practical choices for the objectives. We then provide theoretical and empirical evidence that alternating minimization, from a random initialization, converges to global minima for a large subclass of induced DLMs. In particular, we take advantage of the existence of the (potentially unknown) convex induced form, to identify when stationary points are global minima for the dictionary learning objective. We then provide an empirical investigation into practical optimization choices for using alternating minimization for induced DLMs, for both batch and stochastic gradient descent.